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Abstract 

Multilingual  OCR  has  emerged  as  an  important  information  technology,  thanks  to  the 
increasing  need  for  cross-language  information  access.  While  many  research  groups  and 
companies  have  developed  OCR  algorithms  for  various  languages,  it  is  difhcult  to  compare 
the  performance  of  these  OCR  algorithms  across  languages.  This  difhculty  arises  because 
most  evaluation  methodologies  rely  on  the  use  of  a  document  image  dataset  in  each  of 
the  languages  and  it  is  difhcult  to  hud  document  datasets  in  different  languages  that  are 
similar  in  content  and  layout. 

In  this  paper  we  propose  to  use  the  Bible  as  a  dataset  for  comparing  OCR  accuracy 
across  languages.  Besides  being  available  in  a  wide  range  of  languages,  Bible  translations 
are  closely  parallel  in  content,  carefully  translated,  surprisingly  relevant  with  respect  to 
modern-day  language,  and  quite  inexpensive.  A  project  at  the  University  of  Maryland 
is  currently  implementing  this  idea.  We  have  created  a  scanned  image  dataset  with 
groundtruth  from  an  Arabic  Bible.  We  have  also  used  image  degradation  models  to 
create  synthetically  degraded  images  of  a  French  Bible.  We  hope  to  generate  similar 
Bible  datasets  for  other  languages,  and  we  are  exploring  alternative  corpora  such  as  the 
Koran  and  the  Bhagavad  Gita  that  have  similar  properties.  Quantitative  OCR  evaluation 
based  on  the  Arabic  Bible  dataset  is  currently  in  progress. 


This  research  was  funded  in  part  by  the  Department  of  Defense  and  the  Army  Research  Laboratory 
under  Contract  MDA9049-6C-1250. 
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Abstract 

Multilingual  OCR  has  emerged  as  an  important  information  technology,  thanks  to  the 
increasing  need  for  cross-language  information  access.  While  many  research  groups  and 
companies  have  developed  OCR  algorithms  for  various  languages,  it  is  difficult  to  compare 
the  performance  of  these  OCR  algorithms  across  languages.  This  difficulty  arises  because 
most  evaluation  methodologies  rely  on  the  use  of  a  document  image  dataset  in  each  of 
the  languages  and  it  is  difficult  to  hud  document  datasets  in  different  languages  that  are 
similar  in  content  and  layout. 

In  this  paper  we  propose  to  use  the  Bible  as  a  dataset  for  comparing  OCR  accuracy 
across  languages.  Besides  being  available  in  a  wide  range  of  languages,  Bible  translations 
are  closely  parallel  in  content,  carefully  translated,  surprisingly  relevant  with  respect  to 
modern-day  language,  and  quite  inexpensive.  A  project  at  the  University  of  Maryland 
is  currently  implementing  this  idea.  We  have  created  a  scanned  image  dataset  with 
groundtruth  from  an  Arabic  Bible.  We  have  also  used  image  degradation  models  to 
create  synthetically  degraded  images  of  a  French  Bible.  We  hope  to  generate  similar 
Bible  datasets  for  other  languages,  and  we  are  exploring  alternative  corpora  such  as  the 
Koran  and  the  Bhagavad  Gita  that  have  similar  properties.  Quantitative  OCR  evaluation 
based  on  the  Arabic  Bible  dataset  is  currently  in  progress. 


This  research  was  funded  in  part  by  the  Department  of  Defense  and  the  Army  Research  Laboratory 
under  Contract  MDA9049-6C-1250. 


1  Introduction 


To  evaluate  any  OCR  algorithm  we  need  (i)  datasets  of  scanned  document  images  and  (ii) 
the  corresponding  symbolic  groundtruth.  Obtaining  manually  generated  groundtruth,  as 
is  generally  done,  is  labor-intensive,  time-consuming,  prohibitively  expensive,  and  prone 
to  errors.  Furthermore,  collecting  multilingual  datasets  that  have  similar  content  across 
languages  is  a  non-trivial  task. 

In  this  paper  we  propose  to  use  the  Bible  as  a  corpus  for  evaluating  OCR  accuracy 
across  languages.  As  an  example,  a  French  version  of  the  Bible  provides  us  with  a  test  set 
and  groundtruth  for  a  French  OCR  system,  with  far  less  effort  than  one  would  typically 
expend  in  obtaining  groundtruth  for  a  1000-page  text.  In  addition,  using  the  French 
dataset  also  allows  us  to  make  a  far  more  controlled  comparison  to  our  Arabic  OCR 
system  than  if  we  were  using  groundtruth  from  an  unrelated  text.  At  the  University  of 
Maryland,  we  have  collected  the  electronic  groundtruth  for  Arabic,  Fnglish,  and  French 
Bibles  and  have  scanned  the  paper  version  of  the  Arabic  Bible.  We  have  also  obtained, 
but  not  processed  as  datasets,  versions  of  complete  Bibles  or  New  Testaments  in  17  other 
languages.  All  our  datasets  will  be  made  freely  available  to  OCR  researchers. 

In  Section  2  we  give  a  survey  of  OCR  evaluation  methods  and  programs  that  have 
been  pursued  in  the  past  and  discuss  their  limitations.  In  Section  3  we  explain  why  the 
Bible  is  a  good  corpus  for  the  purposes  of  OCR  evaluation.  We  then  report  on  a  project 
[KKB98]  recently  started  at  the  University  of  Maryland  for  creating  Bible  datasets  with 
groundtruth  in  various  languages.  Finally,  in  Section  5  we  describe  how  the  Bible  corpus 
can  also  be  used  for  generating  synthetically  degraded  multilingual  datasets. 

2  Performance  Evaluation  Methodologies 

Numerous  commercial  OCR  systems  claim  that  their  products  have  near-perfect  recog¬ 
nition  accuracy  (close  to  99.9%).  In  practice,  however,  these  accuracy  rates  are  rarely 
achieved.  Most  systems  break  down  when  the  input  document  images  are  highly  de¬ 
graded,  such  as  scanned  images  of  carbon-copy  documents,  documents  printed  on  low- 
quality  paper,  and  documents  that  are  n-th  generation  photocopies. 

Characterizing  the  performance  of  OCR  systems  is  important  for  many  reasons: 

•  Predict  performance:  Typically  OCR  is  part  of  a  bigger  system,  e.g.,  an  informa¬ 
tion  retrieval  (IR)  system  or  a  machine  translation  (MT)  system.  Since  the  overall 
performance  depends  on  the  performances  of  the  individual  subsystems,  the  over¬ 
all  performance  of  the  MT/IR  system  is  a  function  of  the  OCR  recognition  rate. 
Knowledge  of  end-to-end  performance  as  a  function  of  OCR  accuracy  rate  will  al¬ 
low  us  to  predict  the  minimum  recognition  rate  required  for  achieving  a  specihed 
overall  MT /IR  system  performance  rate. 

•  Monitor  progress:  In  order  to  monitor  progress  in  research/development  of  OCR 
systems,  we  need  quantitative  measures.  Periodic  quantitative  performance  evalu¬ 
ation  of  OCR  systems  will  allow  us  to  assess  progress  in  the  held. 
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•  Provide  scientific  explanation:  Understanding  the  contributions  to  accuracy  im¬ 
provement  by  specific  submodules.  That  is,  explaining  why  a  OCR  system  achieves 
a  particular  accuracy. 

•  Identify  open  problems:  Determining  areas  that  need  improvement/research. 

OCR  evaluation  can  be  broadly  categorized  into  two  types:  i)  blackbox  evaluation 
and  ii)  whitebox  evaluation.  In  blackbox  evaluation  an  entire  OCR  system  is  treated  as 
an  indivisible  unit  and  the  end-to-end  performance  of  the  system  is  characterized.  The 
performance  of  the  system  is  evaluated  as  follows.  First  a  corpus  of  scanned  document 
images  is  selected.  Next,  the  text  zones  are  delineated.  Then,  for  each  text  zone,  the 
correct  text  string  is  keyed  in  by  humans.  The  process  of  delineating  the  zones  and  keying 
in  the  text  is  very  laborious,  expensive,  and  prone  to  errors.  Finally,  the  OCR  algorithm 
is  run  on  each  text  zone  and  the  results  are  compared  with  the  keyed-in  groundtruth  text 
using  a  string  matching  routine.  In  theory  the  corpus  should  be  a  representative  sample 
of  the  population  of  images  for  which  the  algorithm  was  designed.  In  practice,  however, 
factors  like  time  and  cost  forces  us  to  limit  the  size  of  the  dataset  to  something  feasible. 
This  process  was  adopted  by  the  UNLV  OCR  evaluation  program  [RJN96],  the  UW  eval¬ 
uation  process  [CSHP94],  and  the  UMD  Arabic  OCR  evaluation  process  [KMB99].  The 
UNLV  evaluation  corpus  consisted  of  Fnglish  annual  reports,  documents  from  depart¬ 
ment  of  Fnergy,  magazines,  business  letters,  legal  documents,  Spanish  newspapers,  and 
German  business  letters.  The  UW  dataset  [HP"*"]  consisted  of  Fnglish  technical  journals. 
The  UMD  evaluation  used  the  DARPA/SAIC  Arabic  dataset  [DH97],  which  consists  of 
book  chapters,  magazine  articles,  and  computer-generated  documents.  UNLV  evalua¬ 
tion  reported  average  character  and  word  accuracies,  while  UW  and  UMD  evaluations 
reported  average  character  accuracy.  Since  the  Fnglish,  Spanish,  German  and  Arabic 
datasets  have  very  different  content  and  layout,  comparing  accuracies  across  languages 
is  not  very  meaningful. 

Whitebox  evaluation,  on  the  other  hand,  characterizes  the  performance  of  individual 
submodules.  Most  OCR  systems  have  submodules  for  skew  detection  and  correction,  page 
segmentation,  zone  classihcation,  and  text  extraction.  Zone  segmentation  evaluation  has 
been  attempted  earlier  by  Vincent  et  al.  [YV98,  RV94].  Whitebox  evaluation  is  possible 
only  if  the  evaluator  has  access  to  the  input  and  output  of  the  submodules  of  the  OCR 
system.  Thus  for  segmentation  evaluation,  access  to  the  coordinates  of  zones  produced 
by  OCR  is  crucial.  While  blackbox  evaluation  does  not  require  access  to  intermediate 
results,  it  does  not  provide  performance  analysis  at  the  submodule  level.  Furthermore,  the 
blackbox  evaluations  described  above  do  not  take  into  account  errors  due  to  segmentation. 

More  recently,  researchers  have  advocated  the  use  of  synthetically  generated  data 
for  OCR  evaluation.  In  this  methodology  (see  Kanungo  et  al.  [KHP94,  Kan96]),  doc¬ 
uments  are  hrst  typeset  using  a  standard  typesetting  system  such  as  DTj^  or  Word. 
Then  a  noise-free  bitmap  image  of  the  document  and  the  corresponding  groundtruth  is 
automatically  generated.  The  noise-free  bitmap  is  then  degraded  using  a  parametrized 
degradation  model  [Bai92,  KHP94,  Kan96].  The  degradation  level  is  controlled  by  vary¬ 
ing  the  parameters  of  the  model.  This  methodology  has  the  advantage  that  the  laborious 
process  of  manually  typing  in  the  data  is  completely  avoided.  Furthermore,  no  manual 
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scanning  is  required,  and  the  process  is  entirely  independent  of  language  (up  to  the  limits 
of  the  typesetting  software).  Since  the  typesetting  software  is  available  to  us,  the  effects  of 
page  layout,  font  size  and  type  on  OCR  accuracy  can  be  studied  by  conducting  controlled 
experiments.  A  variant  of  the  above  methodology  proposed  by  Kanungo  and  Haralick 
[KH98,  Kan96]  generates  real  degradations  by  printing  the  ideal  document,  scanning  it, 
and  then  transforming  the  ideal  groundtruth  to  match  the  real  image.  This  process  al¬ 
lows  a  researcher  to  generate  groundtruth  at  a  geometric  level  (character  bounding  boxes, 
identity,  font,  etc.)  in  any  language,  which  is  essential  for  building  classihers. 

Although  the  degradation  model  can  be  applied  to  documents  in  any  language,  the 
contents  of  the  documents  in  the  corpus  again  become  crucial  if  we  need  to  compare  OCR 
accuracies  across  languages.  Otherwise,  the  comparisons  are  not  very  meaningful.  In  the 
next  section  we  propose  to  use  the  Bible  as  a  source  of  such  documents  and  discuss  why 
it  is  a  good  dataset,  as  well  as  its  potential  limitations. 

3  The  Bible  as  a  Corpus 

Text  corpora  —  bodies  of  naturally  occurring  text  —  have  in  the  past  5-10  years  become 
a  focal  point  of  research  in  computational  linguistics  [CM93].  The  Bible,  at  hrst  blush, 
seems  like  an  unlikely  resource  for  research  in  language  technology,  conjuring  up  images 
of  archaic  syntax,  atypical  vocabulary,  and  religion-specihc  subject  matter.  However,  as 
Resnik,  Olsen,  and  Diab  [RODss]  discuss,  the  Bible  turns  out  to  be  surprisingly  relevant 
for  research  involving  present-day  language,  if  one  begins  with  a  modern  language  trans¬ 
lation  such  as  the  New  International  Version  (NIV)  for  English.  Resnik  et  al.  evaluate 
the  vocabulary  of  the  NIV  against  two  benchmarks:  the  approximately  2200- word  con¬ 
trol  vocabulary  for  Longman’s  Dictionary  of  Contemporary  English  (LDOCE  [Pro78]), 
and  the  most  frequent  2000  words  in  the  Brown  corpus  of  present-day  American  English 
[EK82]. 

Since  LDOCE  is  a  learner’s  dictionary,  its  control  vocabulary  —  that  is,  the  set  of 
words  used  in  its  dehnitions  —  can  be  viewed  as  a  list  of  particularly  basic  or  useful 
words  in  English,  as  determined  through  an  extensive  process  of  lexicography.  Resnik 
et  al.  hud  that  78-85%  of  the  items  in  the  LDOCE  control  vocabulary  are  found  in  the 
NIV.^  Examination  of  the  LDOCE  vocabulary  also  found  in  the  NIV  shows  that  Bible 
text  contains  ample  vocabulary  representative  of  typical,  everyday  usage,  not  to  mention 
being  representative  of  a  wide  range  of  English  orthography,  as  illustrated  in  the  following 
50-word  random  sample: 

anyone  ashamed  at  hahy  behave  bit  bite  black  blame  build  calm  circular  clay 
cliff  cloth  contain  control  damage  ditch  educate  finish  fit  heart  insect  its  lid  nei¬ 
ther  particular  presence  press  price  pronounce  rent  seem  soap  stand  strength 
strike  take  thick  throw  tonight  undo  vote  weave  west  wheel  wine  within  worst 

A  similar  comparison  with  frequent  words  in  the  Brown  corpus  provides  evidence  that 
a  modern-language  Bible  provides  good  coverage  not  only  of  “useful”  words,  but  of  words 

^The  exact  percentage  depends  on  how  certain  cases  are  handled,  e.g.  whether  or  not  the  word 
practice,  found  in  the  American-published  NIV,  is  counted  as  an  instance  of  the  word  practise  in  the 
British-published  LDOCE. 
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that  occur  frequently  —  the  Brown  corpus  is  an  oft-cited  source  of  word  frequency  data 
for  English,  e.g.  when  controlling  for  word  frequency  in  psychological  experiments,  and 
it  is  also  one  of  the  most  widely  used  corpora  in  natural  language  processing  research. 
Resnik  et  al.  show  that  of  the  most  frequent  2000  words  in  the  Brown  corpus,  fully  75% 
occur  in  the  NIV.  A  50-word  random  sample  includes  the  following: 

achievement  address  arrive  brief  building  call  climb  conclude  dream  dry  ex¬ 
change  extent  family  fast  fat  happen  here  hide  impression  increase  lady  narrow 
nine  observation  officer  opportunity  plan  please  pleasure  public  reflect  relative 
requirement  road  satisfy  search  select  silence  simple  single  spread  straight  test 
tragedy  watch  wave  west  wind  wine  work 

Because  the  Brown  corpus  spans  multiple  genres,  it  is  also  possible  to  assess  vocabu¬ 
lary  coverage  as  a  function  of  text  type.  Resnik  et  al.  show  that  even  for  texts  in  genres 
far  removed  from  Biblical  material,  such  as  science  hction,  theater  and  music  reviews, 
and  science  writing,  the  NIV  text  covers  at  least  two-thirds  of  the  most  frequent  2000 
words  in  each  genre. 

Although  we  have  not  conducted  a  similar  comparison  for  non- English  versions  of  the 
Bible,  it  is  reasonable  to  expect  the  results  to  carry  over:  because  the  underlying  content 
is  the  same,  one  can  expect  similar  patterns  of  vocabulary  content  in  a  modern-language 
version  of  the  Bible,  regardless  of  the  language  in  which  that  content  is  expressed. 

This  parallelism  of  content  at  a  global  level  is  matched  by  parallelism  at  a  much  hner 
grain:  unlike  other  parallel  document  collections,  e.g.  parallel  bilingual  corpora  used  in 
research  on  automatic  machine  translation,  translations  of  the  Bible  contain  verse-level 
parallelism.  This  will  permit  a  hne-grained  level  of  analysis  for  OCR  evaluation  —  e.g., 
some  verses  may  be  difficult  in  any  language  owing  to  the  presence  of  Biblical  names. 
It  also  provides  valuable  parallel  data  for  multilingual  language  processing  applications, 
such  as  the  automatic  discovery  of  term  translations  [Mel96]  and  construction  of  cross¬ 
language  information  retrieval  systems  [LL90].  Indeed,  OCR  work  on  the  Bible  has  the 
potential  to  be  of  great  beneht  to  the  language  technology  community  as  a  whole,  by 
providing  data  in  electronic  form  for  languages  in  which  corpus  data  is  otherwise  difficult 
to  obtain.  Uses  of  the  text  go  well  beyond  simple  acquisition  of  vocabulary:  because  it 
contains  text  by  a  large  set  of  authors,  in  a  variety  of  text  styles,  and  touching  on  a  range 
of  content  areas,  the  Bible  is  also  a  rich  data  source  for  cross-language  data  on  syntactic 
structure  and  semantic  patterning. 

In  summary,  the  advantages  of  using  the  Bible  include  the  following: 

1.  It  exists  in  a  huge  number  of  languages:  as  of  this  writing,  complete  Bible  transla¬ 
tions  exist  in  over  360  languages.  New  Testament  translations  in  over  900,  and  at 
least  one  book  of  the  Bible  in  nearly  2200  languages.^ 

2.  These  hgures  are  increasing  rapidly:  within  the  last  year  13  new  Bible  translations 
were  completed,  25  New  Testaments  were  completed,  and  180  new  translations  were 
initiated. 

^http://www. biblesociety.org/trans-gr.html 


4 


3.  Translations  are  verse  by  verse,  providing  a  reliably  parallel  corpus. 

4.  Bibles  exist  in  print  in  various  formats,  fonts,  and  paper  types. 

5.  Coverage  of  everyday  modern  language  is  surprisingly  high:  approximately  80%  of 
the  dehning  vocabulary  of  Longman’s  Dictionary  of  Contemporary  English  (which 
has  controlled- vocabulary  dehnitions)  can  be  found  in  a  modern- English  translation 
of  the  Bible  (the  New  International  Version). 

6.  Bibles  in  most  common  languages  are  available  on-line  or  in  electronic  form,  often 
free  or  for  a  reasonable  licensing  cost.  This  easily  available  groundtruth  data  frees 
us  of  much  of  the  manual  work. 

7.  The  Bible  is  a  large  corpus  by  the  standards  of  work  in  OCR,  and  non-trivial  by 
the  standards  of  corpus-based  work  in  natural  language  processing.  Eor  example, 
our  Trench  version  has  over  1000  pages,  comprising  on  the  order  of  800,000  words. 

Use  of  the  Bible  as  a  language  resource  is  not  without  its  limitations,  of  course. 
Many  elements  of  modern-day  documents  are  missing  from  its  pages,  such  as  technical 
terminology,  many  modern  proper  names,  and  everyday  words  that  are  of  more  modern 
origin  or  simply  outside  its  scope  (e.g.  atom,  Buddhist^  January^  cat).  Eormats  for 
addresses,  dates,  and  the  like  are  also  clearly  not  present.  Eurthermore,  complex  layouts 
such  as  those  found  in  newspapers  and  magazines,  tables  and  graphics,  are  also  absent 
from  the  Bible  corpus.  (However,  Biblical  poetry,  and  in  some  edition  pictures,  do  appear 
in  the  Bible.)  Thus  there  is  a  tradeoff:  as  a  corpus  the  Bible  provides  an  unmatched 
degree  of  consistency,  availability,  and  parallelism,  at  the  cost  of  some  elements  that 
might  help  distinguish  between  OCR  systems,  e.g.  on  the  basis  of  the  coverage  of  their 
lexicons,  or  page  segmentation  and  zone  classihcation  performance. 

4  Real  Image  Dataset:  The  Scanned  Bible 

At  the  University  of  Maryland  we  have  started  a  project,  which  we  internally  refer  to 
as  Project  Gutenberg  [KKB98],  to  create  scanned  Bible  image  and  groundtruth  datasets 
for  OCR  evaluation  in  various  languages.  At  the  time  of  writing  this  article  we  have 
been  able  to  scan  the  entire  New  Testament  of  the  Arabic  Bible  [IBS88].  The  scan¬ 
ning  was  done  at  600dpi  resolution  and  both  binary  Ibit /pixel  and  grayscale  8bits/pixel 
images  were  scanned.  Two  pages  of  the  Bible  were  scanned  at  a  time.  There  are  198 
scanned  binary  images,  which  are  saved  in  TIE  format  using  Group  4  compression,  and 
198  grayscale  images  stored  in  TIE  format  without  compression.  The  zone  groundtruth 
was  generated  manually  using  the  PinkPanther  software  [YV98,  RV94].  The  attributes 
of  the  zones  —  id,  bounding  box,  type  (body,  running  head,  section  title,  page  num¬ 
ber),  etc.  —  are  stored  in  a  zone  groundtruth  hie.  We  did  not  have  to  type  in  the 
electronic  text  groundtruth  —  it  was  available  on  the  International  Bible  Society  web 
page  http://www.gospelcom.net/ihs.  The  text  encoding  format  is  CP1256.  The  text 
groundtruth  corresponding  to  a  particular  zone,  however,  had  to  be  extracted  from  the 
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electronic  text  groundtruth  files  separately.  We  are  currently  working  on  a  method  to 
make  this  process  automatic. 

In  Figure  1  we  show  a  binary  scanned  image  from  the  Arabic  Bible.  Manually  de¬ 
lineated  zones  for  this  image  are  shown  in  Figure  2.  The  zones  are  non-overlapping 
rectangles  and  are  ordered  in  Arabic  reading  order.  A  section  of  the  zone  groundtruth 
hie  storing  the  attributes  of  a  zone  is  shown  in  Figure  3.  The  details  of  the  database 
design,  hie  formats,  directory  structures,  etc.  is  available  in  a  separate  article  [KKB98]. 


Tl.T.  lijl  .i);  f 


c  1— 'j  d c  0^^  ^  d.B  ■* 

«  ?  aL' 

Jt 

fVL  j  ^  mt-ji 

^  (  olii' 

Ujiii  •  oIJLaJ” 

8  !  jj\ 


A,*  aIajI  Ljaj' 


L- 

^  Y  ^ 


0^  , 


ju*  . 


o-i!^  Si 

I ^  tS)t ^ 

■»  ^if T  I  .  LS”  ^«j 


•J'j  O'fUt;  iB 


4  *'*':*'“  Ait 

‘  ^.5j 

. . .  >^-JblT  (  ^jVL 

4i» . .It 

«! c->\> 

1^4*  • 

S?  ^33  ‘  ^33  ®'3^^ 

.  I  4»,,'-,ti 

^L_L''  I  :  ^ _y_lJ  ^3-*'^* 

.  'Jj—»ryjiy  ’^yryjt 

I /,  ..I  ^ijT 

t  OijjVi'  ;^:Vl  jU^’i 

ip  .  '^'3 

(•— (  lUJi  Liil  'yj-*d  o' 
t\iA  I  o y 

O'  L*'_J  .  tli;' 

^si  iJl  j’-^'  <^.r*  ‘ 

>14^  4  A-i^t  W^Asa,'^ 

.  ;i^  i-'l.* 

aS'  ^  L’  o' aJJ  ^,-J  iUt 
s  !  *-iSS  4 

4  _ Li*  h  II  :  4—^^'  ;  jLii 

1.;*  1^3*'  >  -m-i! 

.  It. .A  aIL— J  jl  .iJJi  Suy 


o|  J'ii  ^ 

3-i;  3  a'S  J  «4 


C=— !■' 

r  ,  :  jli/' 

LiJ^  4  i  ili 

Ltlf  jli  :  ^lUjY 


j  Ai^^r  ls*^  *  Y 

.ilL;  ^  A^  y/>^^  6^1  i^\ 

Xj*  t  i_j-»L9^  p  4 1 

.  1-8^  a)'  ^ ^ 

j.aJW  itAjJLl 

aJ^  ‘ 

.^.  ‘  j‘3-<'  (^'  ^}j^'^- 

J  4  \S(^ yu  ‘  -  a _^....Aj 

^1  «Lli'  I*!*/  4  {»3i*  Ij  >  ■ 
oLilii*  ,^^'' y  ^  ^  ij.!  .^1 
:  jjiTl;  ,,-U;  Ji  4 

ft  4  j  Jl  0 

:  ^  Jli}  4  (U3^  I  ?  V 

J^y  *yy~^'  “ 

Jli^  «  !  8  ;  i^li-U  8  ^  A^ 

t  ^.a.aU  ^y..a«AU  U  I jLc.i  4  Jj'  )  ;  p  |i 

Ls  gi"  . 

I  4  A>.f^  4_ 


<L<lJ 

,Ja—Ju  «_J 

•>JLil»j  4  j^\  ^aJ* 

ua'  4.%-  '<  4  |i_Lii  u 

A-i-'^'^Li  jl  :  — y  *t  Lyj)T_ 

^3^  j'  4  aJ}  u 

•  *_i.r ,  ^  ^LU 


T  T.  tijlAijjf 

ji'  wT?  (^'  •  '  Q 

J^l  V  Ul_)  8  :  £_^.  ij-fri  JUi  .  ^ 
I  !  c. .Ia>  I*  ■*L' ...  ^\j 

ikjJI  Jii 

:  ^  '->  ([■  >^.l»,.t■l^  j»-l5v 

,1^3--  4/-»3 '  •  *3-1  >u-j 

[p.  ^  Jl  >3* 

jf^y  ■  J'*  J  *J^ 

il_i»  .  ^jli  ijijj  »j4y^ 

*>3^  (^*  *^1  •  3*^3  Ia^  ^j'y 
iLi.  ^‘"'  .  S_yLilj 

4  liiii  ij;^  j:^'j\'j 

liU  :  _f3^’  vj 

(V-fUj  4  4_~^'  ^1  J^j'^  Y 

4  J^j'3*i*  «3j  j^  u 

3*  r*^  4^^ 

.  Lli  j  ..aJ  «l,:A;lj  . 

'iiii  .  ajiStj  *y^ 

ih  Ai|  ?  ^4.^  r3^^  J  “^^1 

J01-JJ  4  jlJjjT  jA-ljl 

■  "  Jl 

K  !  Uili-  i  4  iiii  I ilii 
LJ.  jij  8  ;  jLi_j  (*-^J  3-^  y 

4^  Jf  3=^*''  :  a_,^’T  a'V'  bJJb 


•-  jihs  • 


4  J  ■■  IA..A  ^^^L^ 


Figure  1:  Scanned  image  of  an  Arabic  Bible  page. 


5  Synthetic  Image  Dataset:  Model-Based  Degradation 

We  hrst  describe  a  degradation  model  proposed  by  Kanungo  ef  al.  [KHP94,  KHP93, 
Kan96]  that  can  be  used  to  generate  synthetically  degraded  documents.  The  degradations 
produced  by  this  model  are  local  —  typical  degradations  that  appear  while  scanning  a 
hat  paper.  Next  we  use  the  model  to  generate  degraded  images  of  French  Bible  pages. 
The  same  process  can  be  used  for  generating  degraded  Bible  images  in  any  language.  A 
more  general  model  that  accounts  for  perspective  distortions  near  the  spine  of  a  thick 
book  is  described  in  Kanungo  eA  al.  [KHP94]  and  not  discussed  in  this  article. 
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Figure  2:  Zone  groundtruth  for  the  scanned  image  shown  in  Figure  1.  Notice  that  the 
zones  are  rectangular  and  non-overlapping.  The  zones  were  created  using  the  PinkPan- 
ther  groundtruthing  tool[YV98,  RV94]. 


5.1  A  Degradation  Model 

In  this  section  we  present  a  model  that  accounts  for  (i)  pixel  inversion  (from  foreground  to 
background  and  vice  versa)  that  occurs  independently  at  each  pixel  due  to  light  intensity 
fluctuation,  sensitivity  of  the  sensors,  and  thresholding  level,  and  (ii)  blurring  that  occurs 
due  to  the  point-spread  function  of  the  scanner’s  optical  system. 

The  degradation  model  has  six  parameters:  0  =  (?y,  Oo,  ct, /3o, /?,  A;)h  We  model  the 
pixel-flipping  probability  of  a  pixel  as  an  exponential  with  decay  rate  a  function  of 
the  distance  d  from  the  nearest  boundary  pixel.  The  foreground  and  background  4- 
neighbor  distance  of  each  pixel  is  computed  using  a  distance  transform  algorithm  (see 
Borgefors[Bor86]).  The  parameters  cto  and  a  control  the  probability  of  a  foreground  pixel 
switching  to  a  background  pixel,  and  /3o  and  /3  control  the  probability  of  a  background 
pixel  switching  to  a  foreground  pixel.  The  parameter  t]  is  the  constant  probability  of 
flipping  for  all  pixels.  Finally,  the  parameter  A;,  which  is  the  size  of  the  disk  used  in  the 
morphological  closing  operation,  accounts  for  the  correlation  introduced  by  the  point- 
spread  function  of  the  optical  system.  These  parameters  are  used  to  degrade  an  ideal 
binary  image  as  follows. 
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1.  Compute  the  distance  d  of  each  pixel  from  the  character  boundary. 

2.  Flip  each  foreground  pixel  with  probability 

p(0|l,  d,  tto,  a)  =  aoe~“‘^  +r]. 

3.  Flip  each  background  pixel  with  probability 

p{l\0,d,/3o,/3)  =  +  f]. 

4.  Perform  a  morphological  closing  operation  with  a  disk  structuring  element  of  di¬ 
ameter  k.  (See  Haralick  et  al.  [HSZ87]  for  an  introduction  to  morphological  image 
processing.) 

The  noise-free  documents  are  typeset  using  the  I4Tj^  formatting  system  [Lam86, 
Knu88].  The  ASCII  hies  containing  the  text  and  the  I4Tj^  typesetting  information 
are  then  converted  into  a  device-independent  format  (DVI)  using  I4Tf^.  A  software 
program  called  DVI2TIFF,  which  is  a  modihed  version  of  a  DVI  hie  previewer  called 
XDVI  [V“’“90],  is  run  to  produce  one  bit/pixel  binary  images  in  TIFF  format  from  the 
DVI  hies.  Besides  producing  binary  images  of  the  documents,  DVI2TIFF  also  produces 
groundtruth  information  regarding  each  character  in  the  document  image. 

The  local  document  degradation  model  itself  is  another  software  program  called  DDM. 
This  program  takes  as  input  an  ideal  binary  document  image  in  TIFF  format,  and  a  hie 
containing  the  degradation  model  parameter  values,  and  produces  the  binary  degraded 
images  in  TIFF  format.  Both  programs  -  DVI2TIFF  and  DDM  -  are  implemented  in 
the  C  language  and  have  been  tested  on  SUN  and  IBM  machines  running  the  UNIX 
operating  system.  The  software  is  available  on  the  UW  CD-ROM- 1  [HP"*"].  Software  for 
simulating  noisy  documents  using  the  above  degradation  model  is  available  from  Uni¬ 
versity  of  Washington  Fnglish  Document  Database  I,  and  the  model  itself  has  appeared 
in  the  literature  [KHP94,  KHP93,  Kan96].  The  application  of  the  steps  of  our  model  is 
shown  in  Figure  4.  Issues  regarding  model  validation  and  model  parameter  estimation 
have  been  discussed  by  Kanungo  et  al.  [KHP94,  KHB“’“94,  KBH95]  elsewhere. 

5.2  Application  to  Bibles 

Since  the  electronic  text  hies  for  some  versions  of  the  Bible  are  available  at  no  cost,  we 
decided  to  format  the  French  Bible  using  DTgX  [Lam86,  Knu88]  and  then  generated  a 
typeset  Bible.  We  degraded  this  image  synthetically.  In  Figure  5  we  show  a  synthetically 
degraded  image  of  a  page  from  a  French  Bible.  In  Figure  6  the  output  of  the  OCR 
processing  is  shown.  OmniPages. 0  with  the  French  lexicon  was  used  for  generating  this 
text.  The  ASCII  text  generated  by  this  process  resembles  OCR  results  on  real  images 
much  more  closely  than  does  text  generated  by  simulating  single-character  errors. 

We  have  created  degraded  images  of  the  entire  New  Testament  of  the  French  Bible 
and  are  in  the  process  of  creating  degraded  images  of  Bibles  in  other  languages. 
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6  Summary 


We  have  described  a  project  to  create  datasets  for  multilingual  OCR  evaluation.  The 
key  idea  is  to  use  the  Bible  as  a  corpus  in  each  language.  The  Bible  is  well  suited  for 
this  purpose  for  a  variety  of  reasons:  the  printed  Bible  is  available  in  a  huge  range  of 
languages,  the  text  groundtruth  is  already  available  for  many  languages,  the  linguistic 
properties  of  the  modern  Bible  are  close  to  those  of  the  current-day  language,  Bibles 
exist  in  a  variety  of  layouts  and  fonts,  and  the  corpus  is  quite  large.  We  also  showed 
several  scanned  images  from  our  Arabic  Bible  image  dataset  and  the  corresponding  zone 
groundtruth.  It  was  also  demonstrated  that  synthetically  degraded  images  of  the  Bible 
can  be  generated  by  using  a  model-based  degradation  approach.  We  hope  to  generate 
similar  Bible  datasets  for  other  languages,  and  we  are  exploring  alternative  corpora  with 
similar  properties  such  as  the  Koran  and  the  Bhagavad  Gita. 
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BEGII.IMAGE 

FILENAME  /tapas/Gutenberg/arabic/bw/tif 600/ ar061 . tif 

IMAGE.WIDTH  6600 

IMAGE.HEIGHT  5096 

IMAGE.XRES  600 

IMAGE.YRES  600 

END.IMAGE 


BEGIN.REGIONS 
R.TYPE  TEXT 

R.SUBTYPE  RUNNING.HEAD 
R.NUMBER  1 
R.ATTACHMENT  0 
R.PARENT  0 
R_DRAW_TYPE  Box 
R.NAME  Zone! 

R.ATTRIBUTES  BODY 
REGION.POLYGON  500  632  5712  6408 

R.TYPE  TEXT 

R.SUBTYPE  PAGE.NR 

R.NUMBER  2 

R.ATTACHMENT  0 

R.PARENT  0 

R_DRAW_TYPE  Box 

R.NAME  Zone2 

R.ATTRIBUTES  BODY 

REGION.POLYGON  516  612  4952  5124 

R.TYPE  TEXT 

R.SUBTYPE  BODY 

R.NUMBER  3 

R.ATTACHMENT  0 

R.PARENT  0 

R_DRAW_TYPE  Box 

R.NAME  Zone3 

R.ATTRIBUTES  BODY 

REGION.POLYGON  672  1160  5076  6424 


Figure  3:  Part  of  the  zone  groundtruth  file  for  the  zones  shown  in  Figure  2.  This  hie  was 
generated  by  the  PinkPanther  groundtruthing  tool[YV98,  RV94]. 
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Figure  4:  Local  document  degradation  model:  (a)  Ideal  noise-free  character;  (b)  Distance 
transform  of  the  foreground;  (c)  Distance  transform  of  the  background;  (d)  Result  of 
the  random  pixel-flipping  process.  The  probability  of  a  pixel  flipping  is  P(0|d, /3,/)  = 
P{l\d,a,b)  =  0:06“"'^^;  here  a  =  /3  =  2,  cto  =  /3o  =  1;  (e)  Morphological  closing  of  the 
result  in  (d)  by  a  2  X  2  binary  structuring  element. 
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81  Hetboii  et  sa  baalieue,  et  Jaesar  et  sa  bao- 
lieue. 

Chapter  7 

1  Fils  dlaeacar:  Tbdia,  Fua,  Jaeebub  et 
SchimroQ,  qnatre. 

2  Fih  de  Tbc4a:  Ussi,  Eepiu^a,  Jeriel, 
Jachmat,  Jibeam  «t  Samuel,  chef  d<«  maiaons 
de  leurs  p^res,  de  Tbola,  vailla&ta  hommee 
daos  lean  g^^ratioitt;  leur  Q(M[nbie,  du  tempe 
de  David,  etait  de  vin^deux  mille  six  cents. 

3  Fih  dUssi:  Jixracl\}a.  Fih  de  JisraciiUa: 
Micail,  Abdias,  loil,  Jischiia,  en  tout  cixK| 
chefs; 

4  ih  avaient  avec  eux,  s^oa  kuts  g^skraticcs, 
aekm  iee  makoos  de  kun  p^re«,  tr^rte-six 
milk  hommea  de  troupes  anzkes  pour  la 
guerre,  car  Ut  avakat  beauccap  de  femmes  et 
dedk. 

5  Leurs  fr^res,  d’apr^  toutes  ks  families 
d’iasacar,  hontmes  vaillants,  fcmnai«ait  un  to¬ 
tal  de  quaUi^vingt-eept  milk,  eoreguM^r^  dans 
lee  g^ikalogies. 

6  Fih  de  Becganun:  B4ker  et  Jediall, 

troM. 

7  Fik  de  Bela:  Etabon,  Uusi,  Uiki,  Jeri- 
moth  et  Iri,  cinq  chefs  des  matscsu  de  kuts 
P^rcs,  hcrmties  vaillants,  et  enregistr^  dans 
ks  g^^ak>|^  au  noni)re  de  vingt-deux  milk 
trcnto-quatre.  - 

8  FUs  de  B^lcer:  Zermra,  Joasch,  ^kser, 
El^nai,  Omri,  Jer^moth,  Abya,  Anathoth  et 
Alameth,  tons  ceux-14  fils  de  B^er, 

9  et  enregkrtr^s  dans  ke  g^^alogies,  sekm 
ku»  g^xkratkss,  connne  cheh  des  maiscms  de 
kurs  p^res,  hcrnimos  vaillants  su  nombre  de 
vingt  milk  <kux  cents.  - 

10  FUs  de  Jediatl:  Bilhan.  Fik  de  Bilhas: 
Jeusch,  Benjamin,  ^hud,  Kenaana,  Zkthan, 
Ihrsas  et  Achkchachar, 

11  tous  ceux-U  fik  de  JediaU,  cbefe  des 
makcrns  de  kurs  p^res,  hommes  vaillants  au 
nofz^re  de  dix-sept  mille  deux  cents,  en  ^tat 
de  porter  lee  armes  et  d’aller  k  la  guerre. 


12  Scbnppim  et  Huppim,  fik  dir;  Huschim, 
fik  d’AcW. 

18  Fik  de  Nephthali:  Jabtskl,  Guni,  Jetser  et 
SchaOnm,  fik  de  Bilha. 

14  Fik  de  Manass^:  Asriei,  qu’enknta  sa  coi>- 
cubine  syrieime;  elk  enfanta  Makir,  p^e  de 
Galaad. 

15  Makir  prit  une  femme  de  Huppim  et  de 
Schuppim.  Le  i»>m  de  sa  soeur  4tait  Maaca. 
Le  a<»n  du  seccmd  fik  ^tait  Tselophchad;  et 
Teekphchad  eut  des  filks. 

16  MtMtca,  femme  de  Makir,  enfonta  un  fik, 
et  I’appela  du  nom  de  P^resch;  le  nom  de  son 
firkre  ^tait  Sck^resch,  et  see  fik  latent  Ulam 
et 

17  Fik  dUIam;  Bedan.  Ce  sout  Ik  k«  fik  de 
Galaad,  fik  de  Makir,  fik  de  Manass^. 

18  Sa  soeur  Bammokketh  eniknta  Xschhod, 
Abkzer  et  Machla. 

19  Les  fik  de  Schetnida  ^taient:  Acfij^) 
Sich<mi,  Likchi  et  Aniam. 

20  Fik  d’i^hraim:  Schutilach;  B^red,  scm  fik; 
Ihachath,  scm  fik;  Ekada,  son  fik;  Thachath, 
see  fik; 

21  Zabad,  scm  fik;  SchnUlach,  son  fik;  ^ker  et 
£lead.  Lee  hommes  <k  Gath,  sAb  dans  k  pays, 
ks  tu^nt,  parce  qu^k  ^iaient  deecendus  pour 
prendre  kurs  troupeaux. 

22  ]^hralm,  kur  p^,  fiit  loogtemps  dans  k 
deoil,  et  see  fibres  vinzent  pour  k  consoler. 

23  Puk  il  alia  vers  sa  femme,  et  elk  coo^ut 
et  enfanta  un  fik;  il  I’appela  du  nom  de  Beria, 
parce  <pie  k  malheui  ^tait  dans  sa  msisem. 

24  11  eut  pour  filk  Sclk^ra,  qui  batit  Beth 
Horou  la  basse  et  Betb  Horon  la  haute,  et 
Unen  SchA6ra. 

25  lUphach,  son  fik,  et  Rischeph;  Tlklach, 
son  fik;  Thachan,  son  fik; 

26  Laedan,  son  fik;  Ammihad,  son  fik; 
j^Uiechama,  son  fik; 

27  Non,  scm  fik;  Josik,  son  fik. 

28  Ik  avaient  en  pre^ri^  et  pour  habitations 
Bethel  et  ks  vilks  de  son  reesort;  k  Porient, 
Naaran;  k  i’oeddent,  Gu^ser  et  ks  vilks  de 
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Figure  5:  An  artificially  degraded  text  document  image.  The  text  is  a  page  from  a  Bible 
in  the  French  language.  The  layout  was  formatted  using  I^TgX  and  degraded  using  the 
model  described  in  the  article. 
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81  Hesbou  et  sa  banlieue,  et  3  aesar  et  sa  ban- 
lieue. 

chapter  7 

1  Fils  dlwacar:  17hola,  Pua,  Jwchub  et 
Schimron,  quatre. 

2  Fils  de  Thola:  Usai,  Rephaja,  Jeriel, 
Jachmai,  Jibeara  et  Samuel,  chef  deb  maisons 
de  leurs  peres,  de  1101a,  vaillants  honunes 
dam  leurs  generations;  leur  nombre,  du  tempa 
de  David,  etait  de  v  deuX  mille  six  cents. 

3  Fils  dTssi:  Jisrad  ja  Fils  de  Jisrad  ja: 

Mica,  Abdias,  Joa,  JIs  ,  en  tout  cinq 
Chxa; 

4  ils  avaient  avec  eux,  selon  leurs  generations, 
9”  lets  maisons  de  lem  peres,  trente-six 
mille  bomm  de  troupes  armes  pour  la 
guerre,  car  il  avaient  beaucoup  de  fesmues  et 
de  his. 

5  leurs  fteres,  d’apres  toutes  let  funilles 
dimacar,  honunea  vaillants,  formaient  un  tj^. 
W  de  quatre, -vingt-eept  mille,  enregistree  dam 
Im  geneaktw 

6  Fils  de  Benjamin:  Bela,  Beler  et  Jedia, 
trOW 

7  Fils  de  Ba&-  Etsbon,  Uui,  Usiel,  Jeri- 


Figure  6:  OCR  text  for  the  synthetically  degraded  French  Bible  page  image  shown  in 
Figure  5.  We  see  that  image  degradation  models  can  be  used  to  simulate  the  text  output 
produced  by  OCR  systems.  This  methodology  allows  us  to  easily  create  such  simulations 
for  any  language  text,  arbitrary  layouts,  and  any  font,  at  varying  degradation  levels. 
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